Malaria Detection¶
Problem Definition¶
The context: Why is this problem important to solve?
Malaria is a contagious disease that affects a large proportion of the population. Hundreds of thousands are killed annually from malaria, most of these deaths are children less than 5 years old. It is important to diagnose malaria early because the parasites that cause malaria can stay alive in the body for up to 1 year. Malaria can be identified by imaging red blood cells.
The objectives: What is the intended goal?
The goal is to build an efficient model that can identify damaged/parasitized red blood cells. Model will be trained with images of healthy red blood cells and damaged red blood cells. Model needs to accurately discern if the red blood cell is damaged or healthy.
The key questions: What are the key questions that need to be answered?
The key question is: Which model will most accurately and efficiently identify damaged/parasitized red blood cells?
The problem formulation: What is it that we are trying to solve using data science?
Data science will be used to develop a model that can identify damaged/parasitized red blood cells. A model will be trained on images of healthy red blood cells and damaged/parasitized red blood cells, so that it can learn distinct patterns for each group. Model will be able to accurately identify if unseen images of red blood cells are damaged/parasitized or healthy.
Data Description ¶
There are a total of 24,958 train and 2,600 test images (colored) that we have taken from microscopic images. These images are of the following categories:
Parasitized: The parasitized cells contain the Plasmodium parasite which causes malaria
Uninfected: The uninfected cells are free of the Plasmodium parasites
Mount the Drive
# mount google drive
from google.colab import drive
drive.mount('/content/drive/')
Mounted at /content/drive/
Loading libraries¶
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import glob
from google.colab.patches import cv2_imshow
import zipfile
import cv2
import warnings
warnings.filterwarnings('ignore')
import random
Let us load the data¶
Note:
- You must download the dataset from the link provided on Olympus and upload the same to your Google Drive. Then unzip the folder.
# zip1= zip file with training and test data
zip1= "/content/drive/MyDrive/MIT/Capstone_project/cell_images.zip"
# extract images from zip file
with zipfile.ZipFile(zip1, 'r') as zip2:
zip2.extractall()
The extracted folder has different folders for train and test data will contain the different sizes of images for parasitized and uninfected cells within the respective folder name.
The size of all images must be the same and should be converted to 4D arrays so that they can be used as an input for the convolutional neural network. Also, we need to create the labels for both types of images to be able to train and test the model.
Let's do the same for the training data first and then we will use the same code for the test data as well.
# resize x_train images and save to list
# create labels (y_train)
train_dir= "/content/cell_images/train" # filepath to training folder
folders= ['parasitized', 'uninfected'] # folders within training folder
x_train= [] # list to hold training images
y_train= [] # list to hold training labels
# for loop to store images in x_train and labels in y_train
for i in folders:
newpath= Path(train_dir) / i # combined filepath of training directory and folder: parasitized or uninfected
files= newpath.glob('*') # get all files within filepath
j= 0
for j in files:
img1= cv2.imread(str(j)) # read in image from folder
img2= cv2.resize(img1, (64, 64)) # resize image to 64 x 64
x_train.append(img2) # add resized image to list
if i == folders[0]: # if parasitized y_train label= 0
y_train.append(0)
else: # if uninfected y_train label= 1
y_train.append(1)
# convert y_train from list to series
y_train= pd.Series(y_train, name= 'Labels')
# convert x_train to array
x_train= np.array(x_train)
# resize x_test images and save to list
# create labels (y_test)
test_dir= "/content/cell_images/test" # filepath to test folder
x_test= [] # list to hold test images
y_test= [] # list to hold test labels
# for loop to store images in x_test and labels in y_test
i= 0
for i in folders:
newpath2= Path(test_dir) / i # combined filepath of test directory and folder: parasitized or uninfected
files2= newpath2.glob('*') # get all files within filepath
j= 0
for j in files2:
img3= cv2.imread(str(j)) # read in image from folder
img4= cv2.resize(img3, (64, 64)) # resize image to 64 x 64 (matches training image size)
x_test.append(img4) # add resized image to list
if i == folders[0]: # if parasitized y_test label= 0
y_test.append(0)
else: # if uninfected y_test label= 1
y_test.append(1)
# convert y_test from list to series
y_test= pd.Series(y_test, name= 'Labels')
# convert x_test to array
x_test= np.array(x_test)
Check the shape of train and test images
# check shape of x_train
print('Shape of x_train is: {}'.format(x_train.shape))
Shape of x_train is: (24958, 64, 64, 3)
# check shape of x_test
print('Shape of x_test is: {}'.format(x_test.shape))
Shape of x_test is: (2600, 64, 64, 3)
Check the shape of train and test labels
# check shape of y_train
print('Shape of y_train is: {}'.format(y_train.shape))
Shape of y_train is: (24958,)
# check shape of y_test
print('Shape of y_test is: {}'.format(y_test.shape))
Shape of y_test is: (2600,)
Observations and insights:####
Labels were created for training and test images. x_train and x_test images were resized to 64 x 64 x 3 and converted to array.
Check the minimum and maximum range of pixel values for train and test images
# find minimum pixel value, maximum pixel value and pixel value range for training images
min_train= np.min(x_train)
max_train= np.max(x_train)
print("For x_train the minimum pixel value is {}".format(min_train))
print("For x_train the maximum pixel value is {}".format(max_train))
print("For x_train the pixel value range is {}".format(max_train - min_train))
For x_train the minimum pixel value is 0 For x_train the maximum pixel value is 255 For x_train the pixel value range is 255
# find minimum pixel value, maximum pixel value and pixel value range for test images
min_test= np.min(x_test)
max_test= np.max(x_test)
print("For x_test the minimum pixel value is {}".format(min_test))
print("For x_test the maximum pixel value is {}".format(max_test))
print("For x_test the pixel value range is {}".format(max_test - min_test))
For x_test the minimum pixel value is 0 For x_test the maximum pixel value is 255 For x_test the pixel value range is 255
Observations and insights: For both testing and training we see that the maximum pixel value= 255 and minimum pixel value= 0. To normalize can divide by 255.
Count the number of values in both uninfected and parasitized
# value counts for training data, 0= parasitized and 1= uninfected
y_train.value_counts()
| count | |
|---|---|
| Labels | |
| 0 | 12582 |
| 1 | 12376 |
There are 12,582 images of parasitized red blood cells in training image set. There are 12,376 images of healthy red blood cells in the training image set. Training image set is balanced.
# values for test data, 0= parasitized and 1= uninfected
y_test.value_counts()
| count | |
|---|---|
| Labels | |
| 0 | 1300 |
| 1 | 1300 |
There are 1,300 images of parasitized red blood cells in test image set. There are 1,300 images of healthy red blood cells in the test image set. Test image set is balanced.
Normalize the images
# normalize the training set
x_train= x_train/255
# normalize the training set
x_test= x_test/255
Observations and insights: Based on value counts of labels the training data looks balanced. x_train and x_test are normalized (divided by max value).
Plot to check if the data is balanced
# plot the value_counts for training and test to see if data is balanced
names1= ['parasitized', 'uninfected']
fig1, (ax1, ax2)= plt.subplots(1, 2, figsize= (10, 6)) # setup the subplots
ax1.set_title('Training Data') # title for training data barplot
ax1.set_ylabel('Counts') # y label for graph
ax1.set_xticklabels(names1) # new x label for bars
sns.countplot(x= y_train, ax= ax1)
# for loop to add labels to top of bars
for p in ax1.patches:
y1= p.get_height() + 50 # get bar height
x1= p.get_x() + (p.get_width()/2) -0.15 # find middle of bar
total1= y_train.shape[0] # find total number of training labels
label= str(round(100 * ((y1-50)/total1), 1)) + '%' # format label
ax1.annotate(label, (x1, y1), size= 12) # annotate the graph with percent
ax2.set_title('Test Data') # title for the test data
ax2.set_ylabel('Counts') # y label for the test data plot
ax2.set_xticklabels(names1) # x label for the bars
sns.countplot(x= y_test, ax= ax2)
# for loop to add labels to top of bars
for p in ax2.patches:
y2= p.get_height() + 10 # get bar height
x2= p.get_x() + (p.get_width()/2) -0.15 # find middle of bar
total2= y_test.shape[0] # find total number of training labels
label2= str(round(100 * ((y2-10)/total2), 1)) + '%' # format label
ax2.annotate(label2, (x2, y2), size= 12) # annotate the graph with percent
Observations and insights: Both the training and the test datasets are balanced.
Data Exploration¶
Let's visualize the images from the train data
# look at the parasitized images
randomlist1= np.random.randint(0, 12500, size= 10) # make random list, images to plot for parasitized
count1= 0 # counter
# make figure with 10 images
fig3, ax3= plt.subplots(2, 5, figsize= (10, 4))
fig3.tight_layout()
for i in range(2):
for j in range(5):
ax3[i, j].imshow(x_train[randomlist1[count1]]) # plot the images
if y_train[randomlist1[count1]] == 0: # check the label
title3= 'parasitized'
else:
title3= 'uninfected'
count1= count1 + 1 # update the counter
ax3[i, j].set_title(title3) # update title with label
# look at the uninfected images
randomlist2= np.random.randint(13000, 24950, size= 10) # make random list, images to plot for healthy cells
count2= 0 # counter
# make figure with 10 images
fig4, ax4= plt.subplots(2, 5, figsize= (10, 4))
fig4.tight_layout()
for i in range(2):
for j in range(5):
ax4[i, j].imshow(x_train[randomlist2[count2]]) # plot the images
if y_train[randomlist2[count2]] == 0: # check the label
title4= 'parasitized'
else:
title4= 'uninfected'
count2= count2 + 1 # update the counter
ax4[i, j].set_title(title4) # update title with label
Observations and insights: The parasitized red blood cells have at least 1 dark spot. Dark spots appear red or purple. Uninfected red blood cells are more uniform in color. The images dont have consistent lighting: some of the red blood cells are pale while others are shades of purple. The inconsistent lighting is seen in both the parasitized images and the uninfected images.
Visualize the images with subplot(6, 6) and figsize = (12, 12)
# look at 36 images
randomlist3= np.random.randint(0, 24950, size= 36) # make random list of images
count3= 0 # counter
# make figure with 36 images
fig5, ax5= plt.subplots(6, 6, figsize= (12, 12))
fig5.tight_layout()
for i in range(6):
for j in range(6):
ax5[i, j].imshow(x_train[randomlist3[count3]]) # plot the images
if y_train[randomlist3[count3]] == 0: # check the label
title5= 'parasitized'
else:
title5= 'uninfected'
count3= count3 + 1 # update the counter
ax5[i, j].set_title(title5) # update title with label
Observations and insights: The parasitized red blood cells all have at least 1 dark spot. There are inconsistencies in the lighting. Some red blood cells appear pale while others are purple. Both types of red blood cells:parasitized and uninfected have inconsistent lighting.
Plotting the mean images for parasitized and uninfected
# find all of the parasitized training
y_train_infect= y_train[y_train == 0]
y_train_infect_idx= y_train_infect.index # get index for all infected/parasitized training images
# get the images from x_train
x_train_infect= x_train[y_train_infect_idx] # use index from y_train to get correct x_train images
# find all of the uninfected training
y_train_heal= y_train[y_train == 1]
y_train_heal_idx= y_train_heal.index # get index for all uninfected training images
# get the images from x_train
x_train_heal= x_train[y_train_heal_idx] # use index from y_train to get correct x_train images
Mean image for parasitized
# find the average image for infected
avg_infect= np.average(x_train_infect, axis= 0) # take average of all x_train images in parasitized
fig6= plt.figure(figsize= (2, 2)) # plot average image
plt.imshow(avg_infect)
plt.title('Parasitized_avg_image')
Text(0.5, 1.0, 'Parasitized_avg_image')
Mean image for uninfected
# find the average image for healthy red blood cells
avg_heal= np.average(x_train_heal, axis= 0) # take average image for uninfected
fig7= plt.figure(figsize= (2, 2)) # plot average image for uninfected
plt.imshow(avg_heal)
plt.title('Uninfected_avg_image')
Text(0.5, 1.0, 'Uninfected_avg_image')
Observations and insights: The average images for parasitized and uninfected look the same. Both look like purple sphere. Seems like averaging removed useful features: dark spot in the parasitized images.
Converting RGB to HSV of Images using OpenCV
Converting the train data
# convert from BGR to HSV
hsv_train= [] # empty list to store images
for i in x_train: # for loop to convert images
bgr1= i*255 # make images scale from 0 to 255
bgr2= bgr1.astype(np.uint8) # convert images to uint8 so cv2 will accept
hsv1= cv2.cvtColor(bgr2, cv2.COLOR_BGR2HSV) # change from bgr to hsv
hsv_train.append(hsv1) # add hsv images to list
# convert from list to array
hsv_train= np.array(hsv_train)
# look at the parasitized hsv images
randomlist1= np.random.randint(0, 12500, size= 10) # make random list, images to plot for parasitized
count1= 0 # counter
# make figure with 10 images
fig10, ax10= plt.subplots(2, 5, figsize= (10, 4)) # define figure and subplots
fig10.tight_layout() # use tight layout
i= 0
for i in range(2): # loop over rows
j= 0
for j in range(5): # loop over columns
ax10[i, j].imshow(hsv_train[randomlist1[count1]]) # plot the images
if y_train[randomlist1[count1]] == 0: # check the label
title10= 'parasitized'
else:
title10= 'uninfected'
count1= count1 + 1 # update the counter
ax10[i, j].set_title(title10) # update title with label
# look at the parasitized hsv images
randomlist2= np.random.randint(13000, 24950, size= 10) # make random list, images to plot for parasitized
count1= 0 # counter
# make figure with 10 images
fig11, ax11= plt.subplots(2, 5, figsize= (10, 4)) #define subplots in figure
fig11.tight_layout()
i= 0
for i in range(2): # loop over rows
j= 0
for j in range(5): # loop over columns
ax11[i, j].imshow(hsv_train[randomlist2[count1]]) # plot the images
if y_train[randomlist2[count1]] == 0: # check the label
title11= 'parasitized'
else:
title11= 'uninfected'
count1= count1 + 1 # update the counter
ax11[i, j].set_title(title11) # update title with label
Converting the test data
# convert from BGR to HSV
hsv_test= [] # empty list to store images
for i in x_test: # for loop to convert images
bgr10= i*255 # make images scale from 0 to 255
bgr20= bgr10.astype(np.uint8) # convert images to uint8 so cv2 will accept
hsv10= cv2.cvtColor(bgr20, cv2.COLOR_BGR2HSV) # change from bgr to hsv
hsv_test.append(hsv10) # add hsv images to list
# convert from list to array
hsv_test= np.array(hsv_test)
Observations and insights: The HSV images for uninfected cells are predominantly blue, pink or purple. The HSV images for parasitized cells have green/yellow spots.
Processing Images using Gaussian Blurring
Gaussian Blurring on train data
# apply gaussian blur to training images
blur_train= [] # empty list
i= 0
for i in hsv_train: # for loop to make each training image blurred
img1= cv2.cvtColor(i, cv2.COLOR_HSV2BGR) # convert hsv to bgr
blur1= cv2.GaussianBlur(img1, (5, 5), 0) # apply gaussian blur to image
blur_train.append(blur1) # append blurred image to list
blur_train= np.array(blur_train)
# look at the parasitized images vs blurred images
randomlist1= np.random.randint(0, 12500, size= 10) # make random list, images to plot for parasitized
count1= 0 # counter
# make figure with 20 images
fig20, ax20= plt.subplots(10, 2, figsize= (4, 25)) # make figure with 20 subplots
fig20.tight_layout(h_pad= 0.5) # add extra padding between subplots
fig20.suptitle('Image vs Gaussian Blurred') # add title to figure
fig20.subplots_adjust(top= 0.95)
i= 0
for i in range(10):
j= 0
for j in range(2):
if j == 0:
ax20[i, j].imshow(x_train[randomlist1[count1]]) # plot the images
if y_train[randomlist1[count1]] == 0: # check the label
title20= 'parasitized'
else:
title20= 'uninfected'
ax20[i, j].set_title(title20) # update title with label
else:
ax20[i, j].imshow(blur_train[randomlist1[count1]]) # plot the gaussian blurred images
if y_train[randomlist1[count1]] == 0: # check the label
title20= 'parasitized \n gaussian blur'
else:
title20= 'uninfected \n gaussian blur'
count1= count1 + 1 # update the counter
ax20[i, j].set_title(title20) # update title with label
# look at the uninfected images vs blurred images
randomlist2= np.random.randint(13000, 24950, size= 10) # make random list, images to plot for parasitized
count1= 0 # counter
# make figure with 20 images
fig21, ax21= plt.subplots(10, 2, figsize= (4, 25)) # make figure with 20 subplots
fig21.tight_layout(h_pad= 0.5) # add extra padding between subplots
fig21.suptitle('Image vs Gaussian Blurred') # add title to figure
fig21.subplots_adjust(top= 0.95)
i= 0
for i in range(10): # loop over rows
j= 0
for j in range(2): # loop over columns
if j == 0:
ax21[i, j].imshow(x_train[randomlist2[count1]]) # plot the images
if y_train[randomlist2[count1]] == 0: # check the label
title21= 'parasitized'
else:
title21= 'uninfected'
ax21[i, j].set_title(title21) # update title with label
else:
ax21[i, j].imshow(blur_train[randomlist2[count1]]) # plot the gaussian blurred images
if y_train[randomlist2[count1]] == 0: # check the label
title21= 'parasitized \n gaussian blur'
else:
title21= 'uninfected \n gaussian blur'
count1= count1 + 1 # update the counter
ax21[i, j].set_title(title21) # update title with label
Gaussian Blurring on test data
# apply gaussian blur to test images
blur_test= [] # empty list
i= 0
for i in hsv_test: # for loop to make each training image blurred
img2= cv2.cvtColor(i, cv2.COLOR_HSV2BGR) # convert hsv to bgr
blur2= cv2.GaussianBlur(img2, (5, 5), 0) # apply gaussian blur to image
blur_test.append(blur2) # append blurred image to list
blur_test= np.array(blur_test) # convert to np array
Observations and insights: The training and test images were gaussian blurred using kernel size= 5 x 5. For the uninfected red blood cells the gaussian blurred and and images looked very similar. For the parasitized red blood cells the gaussian blurred and images also looked very similar. Gaussian blur didn't really highlight key feature: dark spot in parasitized images.
Think About It: Would blurring help us for this problem statement in any way? What else can we try?
Not really, gaussian blurring is for removing noise. These images aren't really noisy.
Model Building¶
Base Model¶
Note: The Base Model has been fully built and evaluated with all outputs shown to give an idea about the process of the creation and evaluation of the performance of a CNN architecture. A similar process can be followed in iterating to build better-performing CNN architectures.
Importing the required libraries for building and training our Model
# deep learning packages
import tensorflow as tf
import keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, BatchNormalization, Activation, Input, LeakyReLU, Dropout, Flatten
from tensorflow.keras import backend, losses, optimizers
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
One Hot Encoding the train and test labels
# one hot encode y_train
y_train_encode= to_categorical(y_train)
#check shape
y_train_encode.shape
(24958, 2)
# one hot encode y_test
y_test_encode= to_categorical(y_test)
# check shape
y_test_encode.shape
(2600, 2)
Building the model
# clear any old models
backend.clear_session()
# fix the seed for reproducibility
seed= 24
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
# base CNN model to compare other models to
base_model= Sequential()
base_model.add(Conv2D(64, kernel_size= (3,3), padding= 'same', activation= 'relu', input_shape= (64, 64, 3))) # first CNN layer
base_model.add(Conv2D(32, kernel_size= (3,3), padding= 'same', activation= 'relu')) # second CNN layer
base_model.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
base_model.add(Dropout(0.25)) # dropout layer
base_model.add(Flatten()) # flatten layer
base_model.add(Dense(16, activation= 'relu')) # dense layer
base_model.add(Dense(2, activation= 'softmax')) # classification/output layer
# look at model summary
base_model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ conv2d (Conv2D) │ (None, 64, 64, 64) │ 1,792 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_1 (Conv2D) │ (None, 64, 64, 32) │ 18,464 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d (MaxPooling2D) │ (None, 32, 32, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 32, 32, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ flatten (Flatten) │ (None, 32768) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 16) │ 524,304 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 2) │ 34 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 544,594 (2.08 MB)
Trainable params: 544,594 (2.08 MB)
Non-trainable params: 0 (0.00 B)
Compiling the model
opt= Adam(learning_rate= 0.001) # try using Adam optimizer
base_model.compile(optimizer= opt, loss= 'binary_crossentropy', metrics= ['accuracy']) # compile the model
Using Callbacks
# use checkpoints
es= EarlyStopping(monitor= 'val_loss', mode= 'min', verbose= 1, patience= 5) # define early stopping
mc= ModelCheckpoint('best_base_model.h5', monitor= 'val_accuracy', mode= 'max', verbose= 1, save_best_only= True) # save only best model
Fit and train our Model
# fit the model
history1= base_model.fit(x_train, y_train_encode, batch_size= 32, validation_split= 0.2, callbacks= [es, mc], verbose= 1, epochs= 30)
Epoch 1/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.6417 - loss: 0.6348 Epoch 1: val_accuracy improved from -inf to 0.81190, saving model to best_base_model.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
624/624 ━━━━━━━━━━━━━━━━━━━━ 17s 16ms/step - accuracy: 0.6418 - loss: 0.6348 - val_accuracy: 0.8119 - val_loss: 0.5248 Epoch 2/30 622/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8850 - loss: 0.2940 Epoch 2: val_accuracy improved from 0.81190 to 0.96254, saving model to best_base_model.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
624/624 ━━━━━━━━━━━━━━━━━━━━ 11s 9ms/step - accuracy: 0.8851 - loss: 0.2938 - val_accuracy: 0.9625 - val_loss: 0.1781 Epoch 3/30 623/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9314 - loss: 0.1813 Epoch 3: val_accuracy did not improve from 0.96254 624/624 ━━━━━━━━━━━━━━━━━━━━ 11s 11ms/step - accuracy: 0.9314 - loss: 0.1812 - val_accuracy: 0.9621 - val_loss: 0.1841 Epoch 4/30 620/624 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9414 - loss: 0.1552 Epoch 4: val_accuracy improved from 0.96254 to 0.96715, saving model to best_base_model.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
624/624 ━━━━━━━━━━━━━━━━━━━━ 9s 9ms/step - accuracy: 0.9414 - loss: 0.1551 - val_accuracy: 0.9671 - val_loss: 0.1498 Epoch 5/30 621/624 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9475 - loss: 0.1319 Epoch 5: val_accuracy did not improve from 0.96715 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 9ms/step - accuracy: 0.9475 - loss: 0.1319 - val_accuracy: 0.9499 - val_loss: 0.2184 Epoch 6/30 618/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9520 - loss: 0.1216 Epoch 6: val_accuracy did not improve from 0.96715 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 9ms/step - accuracy: 0.9520 - loss: 0.1216 - val_accuracy: 0.9541 - val_loss: 0.2009 Epoch 7/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9561 - loss: 0.1075 Epoch 7: val_accuracy did not improve from 0.96715 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 9ms/step - accuracy: 0.9561 - loss: 0.1075 - val_accuracy: 0.9491 - val_loss: 0.1953 Epoch 8/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9632 - loss: 0.0925 Epoch 8: val_accuracy did not improve from 0.96715 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 9ms/step - accuracy: 0.9632 - loss: 0.0925 - val_accuracy: 0.9623 - val_loss: 0.1950 Epoch 9/30 618/624 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9691 - loss: 0.0773 Epoch 9: val_accuracy did not improve from 0.96715 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 9ms/step - accuracy: 0.9691 - loss: 0.0773 - val_accuracy: 0.9651 - val_loss: 0.1572 Epoch 9: early stopping
Evaluating the model on test data
# load weights from best model
base_model.load_weights('/content/best_base_model.h5')
# make predictions with base model
pred_1= base_model.predict(x_test)
# convert prediction from one hot encode to single value
pred_1= np.argmax(pred_1, axis= 1)
82/82 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step
Plotting the confusion matrix
# classification report
print(classification_report(y_test, pred_1))
precision recall f1-score support
0 0.95 0.94 0.95 1300
1 0.94 0.95 0.95 1300
accuracy 0.95 2600
macro avg 0.95 0.95 0.95 2600
weighted avg 0.95 0.95 0.95 2600
# classification matrix
ConfusionMatrixDisplay.from_predictions(y_test, pred_1, display_labels= names1)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd180317410>
Base model performance is good. Model has high recall and precision. Model accuracy is 0.95. Most of the misclassifications (74 out of 135) are predicting uninfected when it is parasitized, this is the most severe error because these people wouldnt get treated for malaria.
Plotting the train and validation curves
# convert history to dataframe
hist1_dict= history1.history
hist1_df= pd.DataFrame(data= hist1_dict, columns= ['accuracy', 'loss', 'val_accuracy', 'val_loss'])
# make figure to compare training and validation accuracy for base_model
plt.plot(hist1_df['accuracy'], label= 'train') # plot training accuracy
plt.plot(hist1_df['val_accuracy'], label= 'validation') # plot validation accuracy
plt.title('Model accuracy for base_model') # set plot title
plt.ylabel('Accuracy') # set y axis label
plt.xlabel('Epochs') # set x axis label
plt.legend(loc= 'upper right', bbox_to_anchor= (1.3, 1)); # define and place legend
Training and validation have high accuracy. Model seems to fit validation data well, no overfitting.
So now let's try to build another model with few more add on layers and try to check if we can try to improve the model. Therefore try to build a model by adding few layers if required and altering the activation functions.
Model 1
Trying to improve the performance of our model by adding new layers
# reset keras backend state
backend.clear_session()
# fix the seed for reproducibility
seed= 24
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
Building the Model
# CNN model 1
model_1= Sequential()
model_1.add(Conv2D(64, kernel_size= (3,3), padding= 'same', activation= 'relu', input_shape= (64, 64, 3))) # first CNN layer
model_1.add(Conv2D(32, kernel_size= (3,3), padding= 'same', activation= 'relu')) # second CNN layer
model_1.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_1.add(Conv2D(16, kernel_size= (3,3), padding= 'same', activation= 'relu')) # third CNN layer
model_1.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_1.add(Dropout(0.25)) # dropout layer
model_1.add(Conv2D(8, kernel_size= (3,3), padding= 'same', activation= 'relu')) # fourth CNN layer
model_1.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_1.add(Flatten()) # flatten layer
model_1.add(Dense(16, activation= 'relu')) # dense layer
model_1.add(Dropout(0.25)) # dropout layer
model_1.add(Dense(2, activation= 'softmax')) # classification/output layer
# look at summary of model 1
model_1.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ conv2d (Conv2D) │ (None, 64, 64, 64) │ 1,792 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_1 (Conv2D) │ (None, 64, 64, 32) │ 18,464 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d (MaxPooling2D) │ (None, 32, 32, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_2 (Conv2D) │ (None, 32, 32, 16) │ 4,624 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_1 (MaxPooling2D) │ (None, 16, 16, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 16, 16, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_3 (Conv2D) │ (None, 16, 16, 8) │ 1,160 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_2 (MaxPooling2D) │ (None, 8, 8, 8) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ flatten (Flatten) │ (None, 512) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 16) │ 8,208 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_1 (Dropout) │ (None, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 2) │ 34 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 34,282 (133.91 KB)
Trainable params: 34,282 (133.91 KB)
Non-trainable params: 0 (0.00 B)
Compiling the model
opt2= Adam(learning_rate= 0.001) # define the optimizer
model_1.compile(optimizer= opt2, loss= 'binary_crossentropy', metrics= ['accuracy']) # compile the model
Using Callbacks
# use checkpoints
es= EarlyStopping(monitor= 'val_loss', mode= 'min', verbose= 1, patience= 5) # define early stopping
mc= ModelCheckpoint('best_model_1.h5', monitor= 'val_accuracy', mode= 'max', verbose= 1, save_best_only= True) # save only best model
Fit and Train the model
# fit the model
history2= model_1.fit(x_train, y_train_encode, batch_size= 32, validation_split= 0.2, callbacks= [es, mc], verbose= 1, epochs= 30)
Epoch 1/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step - accuracy: 0.6579 - loss: 0.6138 Epoch 1: val_accuracy improved from -inf to 0.99299, saving model to best_model_1.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
624/624 ━━━━━━━━━━━━━━━━━━━━ 15s 16ms/step - accuracy: 0.6580 - loss: 0.6136 - val_accuracy: 0.9930 - val_loss: 0.2040 Epoch 2/30 623/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9645 - loss: 0.1493 Epoch 2: val_accuracy improved from 0.99299 to 0.99339, saving model to best_model_1.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9645 - loss: 0.1492 - val_accuracy: 0.9934 - val_loss: 0.1028 Epoch 3/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9750 - loss: 0.1139 Epoch 3: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9750 - loss: 0.1139 - val_accuracy: 0.9892 - val_loss: 0.1475 Epoch 4/30 621/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9776 - loss: 0.1000 Epoch 4: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9776 - loss: 0.0999 - val_accuracy: 0.9874 - val_loss: 0.1000 Epoch 5/30 622/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9778 - loss: 0.0870 Epoch 5: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9778 - loss: 0.0870 - val_accuracy: 0.9840 - val_loss: 0.1126 Epoch 6/30 621/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9777 - loss: 0.0799 Epoch 6: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9777 - loss: 0.0799 - val_accuracy: 0.9876 - val_loss: 0.0774 Epoch 7/30 621/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9796 - loss: 0.0773 Epoch 7: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9796 - loss: 0.0772 - val_accuracy: 0.9894 - val_loss: 0.0656 Epoch 8/30 622/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9806 - loss: 0.0672 Epoch 8: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9806 - loss: 0.0671 - val_accuracy: 0.9886 - val_loss: 0.0801 Epoch 9/30 619/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9783 - loss: 0.0693 Epoch 9: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9783 - loss: 0.0692 - val_accuracy: 0.9876 - val_loss: 0.0650 Epoch 10/30 623/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9802 - loss: 0.0631 Epoch 10: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9802 - loss: 0.0631 - val_accuracy: 0.9862 - val_loss: 0.0704 Epoch 11/30 619/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9790 - loss: 0.0608 Epoch 11: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9790 - loss: 0.0608 - val_accuracy: 0.9868 - val_loss: 0.0826 Epoch 12/30 621/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9792 - loss: 0.0609 Epoch 12: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9792 - loss: 0.0609 - val_accuracy: 0.9860 - val_loss: 0.0714 Epoch 13/30 619/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9807 - loss: 0.0570 Epoch 13: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9807 - loss: 0.0569 - val_accuracy: 0.9892 - val_loss: 0.0572 Epoch 14/30 621/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9823 - loss: 0.0532 Epoch 14: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9823 - loss: 0.0532 - val_accuracy: 0.9880 - val_loss: 0.0706 Epoch 15/30 620/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9819 - loss: 0.0564 Epoch 15: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9819 - loss: 0.0564 - val_accuracy: 0.9874 - val_loss: 0.0636 Epoch 16/30 623/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9811 - loss: 0.0535 Epoch 16: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9811 - loss: 0.0535 - val_accuracy: 0.9878 - val_loss: 0.0665 Epoch 17/30 621/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9787 - loss: 0.0581 Epoch 17: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9787 - loss: 0.0581 - val_accuracy: 0.9834 - val_loss: 0.0917 Epoch 18/30 619/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9820 - loss: 0.0524 Epoch 18: val_accuracy did not improve from 0.99339 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9820 - loss: 0.0524 - val_accuracy: 0.9854 - val_loss: 0.0883 Epoch 18: early stopping
Evaluating the model
# load weights from best model
model_1.load_weights('/content/best_model_1.h5')
# make predictions with model_1
pred_2= model_1.predict(x_test)
# convert prediction from one hot encode to single value
pred_2= np.argmax(pred_2, axis= 1)
82/82 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step
Plotting the confusion matrix
# classification report
print(classification_report(y_test, pred_2))
precision recall f1-score support
0 0.99 0.97 0.98 1300
1 0.97 0.99 0.98 1300
accuracy 0.98 2600
macro avg 0.98 0.98 0.98 2600
weighted avg 0.98 0.98 0.98 2600
# classification matrix
ConfusionMatrixDisplay.from_predictions(y_test, pred_2, display_labels= names1)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd175fbee90>
Model_1 is improved over base model. The precision, recall and accuracy are higher than for base model. The number of misclassifications of parasitized as uninfected has decreased significantly (40 for model_1). Misclassifications of parasitized as uninfected are worst error because these people wouldn't get treated for malaria. In total there were 51 misclassifications. Activation functions: relu and leaky relu were tried, relu gave fewer misclassifications.
Plotting the train and the validation curves
# make the history dataframe
hist2_dict= history2.history
hist2_df= pd.DataFrame(hist2_dict, columns= ['accuracy', 'loss', 'val_accuracy', 'val_loss'])
# make figure to compare training and validation accuracy for base_model
plt.plot(hist2_df['accuracy'], label= 'train') # plot training accuracy
plt.plot(hist2_df['val_accuracy'], label= 'validation') # plot validation accuracy
plt.title('Model accuracy for model_1') # set plot title
plt.ylabel('Accuracy') # set y axis label
plt.xlabel('Epochs') # set x axis label
plt.legend(loc= 'upper right', bbox_to_anchor= (1.3, 1)); # define and place legend
Both training and validation have high accuracy, model isn't overfit. Model has high accuracy for validation.
Think about it:
Now let's build a model with LeakyRelu as the activation function
- Can the model performance be improved if we change our activation function to LeakyRelu?
- Can BatchNormalization improve our model?
Let us try to build a model using BatchNormalization and using LeakyRelu as our activation function.
Model 2 with Batch Normalization
# reset keras backend state
backend.clear_session()
# fix the seed for reproducibility
seed= 24
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
Building the Model
# CNN model 2
model_2= Sequential()
model_2.add(Conv2D(64, kernel_size= (3,3), padding= 'same', input_shape= (64, 64, 3))) # first CNN layer
model_2.add(LeakyReLU(negative_slope= 0.1)) # leaky relu layer
model_2.add(Conv2D(32, kernel_size= (3,3), padding= 'same')) # second CNN layer
model_2.add(LeakyReLU(negative_slope= 0.1)) # leaky relu layer
model_2.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_2.add(Conv2D(16, kernel_size= (3,3), padding= 'same')) # third CNN layer
model_2.add(LeakyReLU(negative_slope= 0.1)) # leaky relu layer
model_2.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_2.add(Dropout(0.25)) # dropout layer
model_2.add(Conv2D(8, kernel_size= (3,3), padding= 'same')) # fourth CNN layer
model_2.add(LeakyReLU(negative_slope= 0.1)) # leaky relu layer
model_2.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_2.add(BatchNormalization()) # Batch Normalization layer
model_2.add(Flatten()) # flatten layer
model_2.add(Dense(16)) # dense layer
model_2.add(LeakyReLU(negative_slope= 0.1)) # leaky relu layer
model_2.add(Dropout(0.25)) # dropout layer
model_2.add(Dense(2, activation= 'softmax')) # classification/output layer
# summary of the model
model_2.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ conv2d (Conv2D) │ (None, 64, 64, 64) │ 1,792 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ leaky_re_lu (LeakyReLU) │ (None, 64, 64, 64) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_1 (Conv2D) │ (None, 64, 64, 32) │ 18,464 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ leaky_re_lu_1 (LeakyReLU) │ (None, 64, 64, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d (MaxPooling2D) │ (None, 32, 32, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_2 (Conv2D) │ (None, 32, 32, 16) │ 4,624 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ leaky_re_lu_2 (LeakyReLU) │ (None, 32, 32, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_1 (MaxPooling2D) │ (None, 16, 16, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 16, 16, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_3 (Conv2D) │ (None, 16, 16, 8) │ 1,160 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ leaky_re_lu_3 (LeakyReLU) │ (None, 16, 16, 8) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_2 (MaxPooling2D) │ (None, 8, 8, 8) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ batch_normalization │ (None, 8, 8, 8) │ 32 │ │ (BatchNormalization) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ flatten (Flatten) │ (None, 512) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 16) │ 8,208 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ leaky_re_lu_4 (LeakyReLU) │ (None, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_1 (Dropout) │ (None, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 2) │ 34 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 34,314 (134.04 KB)
Trainable params: 34,298 (133.98 KB)
Non-trainable params: 16 (64.00 B)
Compiling the model
# compile the model
opt3= Adam(learning_rate= 0.001) # define the optimizer
model_2.compile(optimizer= opt3, loss= 'binary_crossentropy', metrics= ['accuracy'])
Using callbacks
# use checkpoints
es= EarlyStopping(monitor= 'val_loss', mode= 'min', verbose= 1, patience= 5) # define early stopping
mc= ModelCheckpoint('best_model_2.h5', monitor= 'val_accuracy', mode= 'max', verbose= 1, save_best_only= True) # save only best model
Fit and train the model
# fit the model
history3= model_2.fit(x_train, y_train_encode, batch_size= 32, validation_split= 0.2, callbacks= [es, mc], verbose= 1, epochs= 30)
Epoch 1/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step - accuracy: 0.6408 - loss: 0.6524 Epoch 1: val_accuracy improved from -inf to 0.99459, saving model to best_model_2.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
624/624 ━━━━━━━━━━━━━━━━━━━━ 15s 17ms/step - accuracy: 0.6409 - loss: 0.6523 - val_accuracy: 0.9946 - val_loss: 0.1360 Epoch 2/30 623/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9505 - loss: 0.1539 Epoch 2: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 7s 11ms/step - accuracy: 0.9505 - loss: 0.1539 - val_accuracy: 0.9868 - val_loss: 0.0657 Epoch 3/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9684 - loss: 0.1035 Epoch 3: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9684 - loss: 0.1035 - val_accuracy: 0.9908 - val_loss: 0.0479 Epoch 4/30 621/624 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9722 - loss: 0.0899 Epoch 4: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 7s 10ms/step - accuracy: 0.9722 - loss: 0.0899 - val_accuracy: 0.9788 - val_loss: 0.0967 Epoch 5/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9752 - loss: 0.0840 Epoch 5: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 11ms/step - accuracy: 0.9752 - loss: 0.0840 - val_accuracy: 0.9904 - val_loss: 0.0482 Epoch 6/30 619/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9734 - loss: 0.0857 Epoch 6: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9734 - loss: 0.0857 - val_accuracy: 0.9876 - val_loss: 0.0622 Epoch 7/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9762 - loss: 0.0739 Epoch 7: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - accuracy: 0.9762 - loss: 0.0739 - val_accuracy: 0.9898 - val_loss: 0.0430 Epoch 8/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9776 - loss: 0.0712 Epoch 8: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 11ms/step - accuracy: 0.9776 - loss: 0.0712 - val_accuracy: 0.9922 - val_loss: 0.0300 Epoch 9/30 624/624 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9766 - loss: 0.0719 Epoch 9: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 10s 11ms/step - accuracy: 0.9766 - loss: 0.0719 - val_accuracy: 0.9898 - val_loss: 0.0403 Epoch 10/30 623/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9780 - loss: 0.0664 Epoch 10: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9780 - loss: 0.0664 - val_accuracy: 0.9854 - val_loss: 0.0563 Epoch 11/30 619/624 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9773 - loss: 0.0685 Epoch 11: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 7s 11ms/step - accuracy: 0.9773 - loss: 0.0685 - val_accuracy: 0.9840 - val_loss: 0.0642 Epoch 12/30 622/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9788 - loss: 0.0618 Epoch 12: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.9788 - loss: 0.0618 - val_accuracy: 0.9856 - val_loss: 0.0525 Epoch 13/30 619/624 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.9778 - loss: 0.0639 Epoch 13: val_accuracy did not improve from 0.99459 624/624 ━━━━━━━━━━━━━━━━━━━━ 7s 10ms/step - accuracy: 0.9778 - loss: 0.0639 - val_accuracy: 0.9914 - val_loss: 0.0349 Epoch 13: early stopping
Plotting the train and validation accuracy
# make the history dataframe
hist3_dict= history3.history
hist3_df= pd.DataFrame(hist3_dict, columns= ['accuracy', 'loss', 'val_accuracy', 'val_loss'])
# make figure to compare training and validation accuracy for base_model
plt.plot(hist3_df['accuracy'], label= 'train') # plot training accuracy
plt.plot(hist3_df['val_accuracy'], label= 'validation') # plot validation accuracy
plt.title('Model accuracy for model_2') # set plot title
plt.ylabel('Accuracy') # set y axis label
plt.xlabel('Epochs') # set x axis label
plt.legend(loc= 'upper right', bbox_to_anchor= (1.3, 1)); # define and place legend
Evaluating the model
# load weights from best model
model_2.load_weights('/content/best_model_2.h5')
# make predictions with model_2
pred_3= model_2.predict(x_test)
# convert prediction from one hot encode to single value
pred_3= np.argmax(pred_3, axis= 1)
82/82 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step
Generate the classification report and confusion matrix
# classification report
print(classification_report(y_test, pred_3))
precision recall f1-score support
0 0.99 0.81 0.89 1300
1 0.84 0.99 0.91 1300
accuracy 0.90 2600
macro avg 0.91 0.90 0.90 2600
weighted avg 0.91 0.90 0.90 2600
# classification matrix
ConfusionMatrixDisplay.from_predictions(y_test, pred_3, display_labels= names1)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd1225c4b50>
Observations and insights:
Model_2 has very similar architecture to model_1 except model_2 uses leaky ReLU for activation function and model_2 has batch normalization layer. I tried different positions for the batch normalization layer, this architecture gave the highest accuracy. Model_2 has lower accuracy (0.9) than model_1. Model_2 has in total 264 misclassifications, more misclassifications than model_1. The uninfected cells are predicted well. However, the parasitized cells are not predicted well. There are many misclassifications of parasitized as uninfected, this is very bad as these people wouldn't get treated for malaria.
Think About It :
- Can we improve the model with Image Data Augmentation?
- References to image data augmentation can be seen below:
Model 3 with Data Augmentation
# separate y_train and x_train into parasitized and uninfected
# uninfected only
y_train_1= y_train[y_train == 1]
y_train_1_idx= y_train_1.index # index of uninfected y_train
x_train_1= x_train[y_train_1_idx] # x_train array for uninfected
x_train_1= x_train_1 * 255
#parasitized only
y_train_0= y_train[y_train == 0]
y_train_0_idx= y_train_0.index # index of parasitized y_train
x_train_0= x_train[y_train_0_idx] # x_train array for parasitized
x_train_0= x_train_0 * 255
print('shape of x_train_1 uninfected is: ', x_train_1.shape)
print('shape of x_train_0 parasitized is: ', x_train_0.shape)
print('shape of y_train_1 uninfected is: ', y_train_1.shape)
print('shape of y_train_0 parasitized is: ', y_train_0.shape)
x_train_list= [x_train_0, x_train_1] # list with separate arrays for parasitized and uninfected
shape of x_train_1 uninfected is: (12376, 64, 64, 3) shape of x_train_0 parasitized is: (12582, 64, 64, 3) shape of y_train_1 uninfected is: (12376,) shape of y_train_0 parasitized is: (12582,)
Use image data generator
# use ImageDataGenerator to make augmented images
new_path_1= '/content/drive/MyDrive/MIT/Capstone_project/aug_parasitized' # folder to save augmented parasitized images
new_path_2= '/content/drive/MyDrive/MIT/Capstone_project/aug_uninfected' # folder to save augmented uninfected images
#image date generator with rotation
datagen= ImageDataGenerator(rotation_range= 180, fill_mode= 'constant')
# loop to generate augmented images
i= 0
for i, j in enumerate(x_train_list):
count1= 0
# make augmented images for parasitized
if i == 0:
for batch in datagen.flow(j, batch_size= 10, save_to_dir= new_path_1, save_prefix= 'parasitized', save_format= '.png'):
count1= count1 + 1
if count1 > 4:
break
# make augmented images for uninfected
else:
for batch in datagen.flow(j, batch_size= 10, save_to_dir= new_path_2, save_prefix= 'uninfected', save_format= '.png'):
count1= count1 + 1
if count1 > 4:
break
# processing augmented images
# read in the augmented images and make labels
x_train_aug= [] # empty list to hold augmented arrays
y_train_aug= [] # empty list to hold labels
path1= Path('//content/drive/MyDrive/MIT/Capstone_project') # directory to hold augmented images
folders2= ['aug_parasitized', 'aug_uninfected'] # folders within directory to hold each type of image
i= 0
for i in range(2): # for loop to read in images, resize and store as arrays also to store labels
newpath1= path1 / folders2[i]
allfiles1= newpath1.glob("*") # get all files in folder
j= 0
for j in allfiles1:
img1= cv2.imread(str(j)) # read files in folder
img2= cv2.resize(img1, (64, 64)) # resize image
x_train_aug.append(img2) # add array to list
if i == 0:
y_train_aug.append(0) # if image parasitized then add label= 0
else:
y_train_aug.append(1) # if image uninfected then add label= 1
# normalize x_train_aug, make array
x_train_aug= np.array(x_train_aug)
x_train_aug= x_train_aug/255
# convert list to series
y_train_aug= pd.Series(data= y_train_aug, name= 'Labels')
# check shape of x_train_aug and y_train_aug
print('shape of x_train_aug is ', x_train_aug.shape)
print('shape of y_train_aug is ', y_train_aug.shape)
shape of x_train_aug is (100, 64, 64, 3) shape of y_train_aug is (100,)
Think About It :
- Check if the performance of the model can be improved by changing different parameters in the ImageDataGenerator.
Visualizing Augmented images
# look at 10 images
randomlist3= np.random.randint(0, 50, size= 10) # make random list of images
count3= 0 # counter
# make figure with 10 images
fig30, ax30= plt.subplots(2, 5, figsize= (10, 4)) # define subplots
fig30.tight_layout() # use tight layout
i= 0
for i in range(2): # loop for number of rows
j= 0
for j in range(5): # loop for number of columns
ax30[i, j].imshow(x_train_aug[randomlist3[count3]]) # plot the images
if y_train_aug[randomlist3[count3]] == 0: # check the label
title30= 'parasitized'
else:
title30= 'uninfected'
count3= count3 + 1 # update the counter
ax30[i, j].set_title(title30) # update title with label
# look at 10 images
randomlist4= np.random.randint(51, 100, size= 10) # make random list of images
count4= 0 # counter
# make figure with 10 images
fig31, ax31= plt.subplots(2, 5, figsize= (10, 4)) # define figure subplots
fig31.tight_layout() # use tight layout
i= 0
for i in range(2): # loop for number of rows
j= 0
for j in range(5): # loop for number of columns
ax31[i, j].imshow(x_train_aug[randomlist4[count4]]) # plot the images
if y_train_aug[randomlist4[count4]] == 0: # check the label
title31= 'parasitized'
else:
title31= 'uninfected'
count4= count4 + 1 # update the counter
ax31[i, j].set_title(title31) # update title with label
Observations and insights:
ImageDataGenerator was used to generate augmented images. 6000 augmented infected images and 6000 augmented parasitized images were generated. ImageDataGenerator augmented: rotation, zoom, brightness and shear.
# make the image data generator pull from training directory
train_dir= '/content/cell_images/train' # training directory containing folders: parasitized and uninfected
batch1= 100 # batch size
# make the training generator
datagen_train= ImageDataGenerator(rescale= 1.0/255, validation_split= 0.2, rotation_range= 180, fill_mode= 'constant')
generator_train= datagen_train.flow_from_directory(train_dir, seed= 24, target_size= (64, 64), batch_size= batch1, subset= 'training')
# make the validation generator
datagen_val= ImageDataGenerator(rescale= 1.0/255, validation_split= 0.2)
generator_val= datagen_val.flow_from_directory(train_dir, seed= 24, target_size= (64, 64), batch_size= batch1, subset= 'validation')
Found 19967 images belonging to 2 classes. Found 4991 images belonging to 2 classes.
Building the Model
# reset keras backend state
backend.clear_session()
# fix the seed for reproducibility
seed= 24
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
# same architecture as model_1 but renamed model_aug
# same layers as model 1 just renamed
model_aug= Sequential()
model_aug.add(Conv2D(64, kernel_size= (3,3), padding= 'same', activation= 'relu', input_shape= (64, 64, 3))) # first CNN layer
model_aug.add(Conv2D(32, kernel_size= (3,3), padding= 'same', activation= 'relu')) # second CNN layer
model_aug.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_aug.add(Conv2D(16, kernel_size= (3,3), padding= 'same', activation= 'relu')) # third CNN layer
model_aug.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_aug.add(Dropout(0.25)) # dropout layer
model_aug.add(Conv2D(8, kernel_size= (3,3), padding= 'same', activation= 'relu')) # fourth CNN layer
model_aug.add(MaxPooling2D(pool_size= (2,2))) # max pooling layer
model_aug.add(Flatten()) # flatten layer
model_aug.add(Dense(16, activation= 'relu')) # dense layer
model_aug.add(Dropout(0.25)) # dropout layer
model_aug.add(Dense(2, activation= 'softmax')) # classification/output layer
# model_aug summary
model_aug.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ conv2d (Conv2D) │ (None, 64, 64, 64) │ 1,792 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_1 (Conv2D) │ (None, 64, 64, 32) │ 18,464 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d (MaxPooling2D) │ (None, 32, 32, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_2 (Conv2D) │ (None, 32, 32, 16) │ 4,624 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_1 (MaxPooling2D) │ (None, 16, 16, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 16, 16, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_3 (Conv2D) │ (None, 16, 16, 8) │ 1,160 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_2 (MaxPooling2D) │ (None, 8, 8, 8) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ flatten (Flatten) │ (None, 512) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 16) │ 8,208 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_1 (Dropout) │ (None, 16) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 2) │ 34 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 34,282 (133.91 KB)
Trainable params: 34,282 (133.91 KB)
Non-trainable params: 0 (0.00 B)
# compile the model
opt4= Adam(learning_rate= 0.001) # define the optimizer
model_aug.compile(optimizer= opt4, loss= 'binary_crossentropy', metrics= ['accuracy'])
Using Callbacks
# use checkpoints
es= EarlyStopping(monitor= 'val_loss', mode= 'min', verbose= 1, patience= 5) # define early stopping
mc= ModelCheckpoint('best_model_aug.h5', monitor= 'val_accuracy', mode= 'max', verbose= 1, save_best_only= True) # save only best model
Fit and Train the model
# fit the model
history4= model_aug.fit(generator_train, steps_per_epoch= 199, epochs= 30, validation_data= generator_val, validation_steps= 50, callbacks= [es, mc], verbose= 1)
Epoch 1/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 254ms/step - accuracy: 0.5589 - loss: 0.6820 Epoch 1: val_accuracy improved from -inf to 0.89000, saving model to best_model_aug.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 67s 300ms/step - accuracy: 0.5593 - loss: 0.6818 - val_accuracy: 0.8900 - val_loss: 0.4116 Epoch 2/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - accuracy: 0.8600 - loss: 0.4056 Epoch 2: val_accuracy improved from 0.89000 to 0.89902, saving model to best_model_aug.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 5s 25ms/step - accuracy: 0.8600 - loss: 0.4056 - val_accuracy: 0.8990 - val_loss: 0.4251 Epoch 3/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 198ms/step - accuracy: 0.8974 - loss: 0.3434 Epoch 3: val_accuracy improved from 0.89902 to 0.96975, saving model to best_model_aug.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 55s 221ms/step - accuracy: 0.8975 - loss: 0.3430 - val_accuracy: 0.9697 - val_loss: 0.0862 Epoch 4/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 36ms/step - accuracy: 0.9400 - loss: 0.1253 Epoch 4: val_accuracy did not improve from 0.96975 199/199 ━━━━━━━━━━━━━━━━━━━━ 5s 25ms/step - accuracy: 0.9400 - loss: 0.1253 - val_accuracy: 0.9693 - val_loss: 0.0865 Epoch 5/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 196ms/step - accuracy: 0.9704 - loss: 0.1258 Epoch 5: val_accuracy improved from 0.96975 to 0.97435, saving model to best_model_aug.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 82s 248ms/step - accuracy: 0.9705 - loss: 0.1258 - val_accuracy: 0.9744 - val_loss: 0.0893 Epoch 6/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 36ms/step - accuracy: 0.9800 - loss: 0.1393 Epoch 6: val_accuracy did not improve from 0.97435 199/199 ━━━━━━━━━━━━━━━━━━━━ 4s 22ms/step - accuracy: 0.9800 - loss: 0.1393 - val_accuracy: 0.9736 - val_loss: 0.0947 Epoch 7/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 195ms/step - accuracy: 0.9724 - loss: 0.1093 Epoch 7: val_accuracy did not improve from 0.97435 199/199 ━━━━━━━━━━━━━━━━━━━━ 44s 221ms/step - accuracy: 0.9725 - loss: 0.1093 - val_accuracy: 0.9738 - val_loss: 0.0832 Epoch 8/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 36ms/step - accuracy: 0.9900 - loss: 0.0702 Epoch 8: val_accuracy improved from 0.97435 to 0.97475, saving model to best_model_aug.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 4s 22ms/step - accuracy: 0.9900 - loss: 0.0702 - val_accuracy: 0.9748 - val_loss: 0.0812 Epoch 9/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 199ms/step - accuracy: 0.9735 - loss: 0.1047 Epoch 9: val_accuracy improved from 0.97475 to 0.97656, saving model to best_model_aug.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 45s 225ms/step - accuracy: 0.9735 - loss: 0.1047 - val_accuracy: 0.9766 - val_loss: 0.0723 Epoch 10/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 38ms/step - accuracy: 0.9900 - loss: 0.0669 Epoch 10: val_accuracy improved from 0.97656 to 0.97816, saving model to best_model_aug.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 5s 26ms/step - accuracy: 0.9900 - loss: 0.0669 - val_accuracy: 0.9782 - val_loss: 0.0688 Epoch 11/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 196ms/step - accuracy: 0.9780 - loss: 0.0913 Epoch 11: val_accuracy did not improve from 0.97816 199/199 ━━━━━━━━━━━━━━━━━━━━ 43s 218ms/step - accuracy: 0.9779 - loss: 0.0913 - val_accuracy: 0.9778 - val_loss: 0.0702 Epoch 12/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 36ms/step - accuracy: 0.9800 - loss: 0.1068 Epoch 12: val_accuracy did not improve from 0.97816 199/199 ━━━━━━━━━━━━━━━━━━━━ 5s 26ms/step - accuracy: 0.9800 - loss: 0.1068 - val_accuracy: 0.9780 - val_loss: 0.0698 Epoch 13/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 194ms/step - accuracy: 0.9774 - loss: 0.0860 Epoch 13: val_accuracy improved from 0.97816 to 0.98097, saving model to best_model_aug.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 43s 216ms/step - accuracy: 0.9774 - loss: 0.0860 - val_accuracy: 0.9810 - val_loss: 0.0597 Epoch 14/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 36ms/step - accuracy: 0.9700 - loss: 0.0491 Epoch 14: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 4s 21ms/step - accuracy: 0.9700 - loss: 0.0491 - val_accuracy: 0.9796 - val_loss: 0.0602 Epoch 15/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 202ms/step - accuracy: 0.9768 - loss: 0.0880 Epoch 15: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 45s 228ms/step - accuracy: 0.9768 - loss: 0.0880 - val_accuracy: 0.9804 - val_loss: 0.0618 Epoch 16/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 36ms/step - accuracy: 0.9900 - loss: 0.0312 Epoch 16: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 4s 22ms/step - accuracy: 0.9900 - loss: 0.0312 - val_accuracy: 0.9790 - val_loss: 0.0614 Epoch 17/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 197ms/step - accuracy: 0.9793 - loss: 0.0832 Epoch 17: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 76s 220ms/step - accuracy: 0.9793 - loss: 0.0832 - val_accuracy: 0.9798 - val_loss: 0.0593 Epoch 18/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 36ms/step - accuracy: 0.9800 - loss: 0.0865 Epoch 18: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 5s 26ms/step - accuracy: 0.9800 - loss: 0.0865 - val_accuracy: 0.9800 - val_loss: 0.0593 Epoch 19/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 198ms/step - accuracy: 0.9784 - loss: 0.0841 Epoch 19: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 78s 224ms/step - accuracy: 0.9784 - loss: 0.0841 - val_accuracy: 0.9794 - val_loss: 0.0597 Epoch 20/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 36ms/step - accuracy: 0.9700 - loss: 0.0518 Epoch 20: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 4s 22ms/step - accuracy: 0.9700 - loss: 0.0518 - val_accuracy: 0.9796 - val_loss: 0.0594 Epoch 21/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 198ms/step - accuracy: 0.9805 - loss: 0.0774 Epoch 21: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 45s 224ms/step - accuracy: 0.9805 - loss: 0.0774 - val_accuracy: 0.9746 - val_loss: 0.0703 Epoch 22/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 7s 37ms/step - accuracy: 0.9800 - loss: 0.0605 Epoch 22: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 5s 24ms/step - accuracy: 0.9800 - loss: 0.0605 - val_accuracy: 0.9742 - val_loss: 0.0718 Epoch 23/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 197ms/step - accuracy: 0.9777 - loss: 0.0806 Epoch 23: val_accuracy did not improve from 0.98097 199/199 ━━━━━━━━━━━━━━━━━━━━ 44s 219ms/step - accuracy: 0.9777 - loss: 0.0806 - val_accuracy: 0.9784 - val_loss: 0.0664 Epoch 23: early stopping
Evaluating the model
Plot the train and validation accuracy
# confirm the class labels from imagedatagenerator
print(generator_train.class_indices)
{'parasitized': 0, 'uninfected': 1}
# make the history dataframe
hist4_dict= history4.history
hist4_df= pd.DataFrame(hist4_dict, columns= ['accuracy', 'loss', 'val_accuracy', 'val_loss'])
# make figure to compare training and validation accuracy for base_model
plt.plot(hist4_df['accuracy'], label= 'train') # plot training accuracy
plt.plot(hist4_df['val_accuracy'], label= 'validation') # plot validation accuracy
plt.title('Model accuracy for model_aug \n has Data Augmentation') # set plot title
plt.ylabel('Accuracy') # set y axis label
plt.xlabel('Epochs') # set x axis label
plt.legend(loc= 'upper right', bbox_to_anchor= (1.3, 1)); # define and place legend
Plotting the classification report and confusion matrix
# load weights from best model
model_aug.load_weights('/content/best_model_aug.h5')
# make predictions with model_aug
pred_4= model_aug.predict(x_test)
# convert prediction from one hot encode to single value
pred_4= np.argmax(pred_4, axis= 1)
82/82 ━━━━━━━━━━━━━━━━━━━━ 1s 8ms/step
# classification report
print(classification_report(y_test, pred_4))
precision recall f1-score support
0 0.99 0.99 0.99 1300
1 0.99 0.99 0.99 1300
accuracy 0.99 2600
macro avg 0.99 0.99 0.99 2600
weighted avg 0.99 0.99 0.99 2600
# classification matrix
ConfusionMatrixDisplay.from_predictions(y_test, pred_4, display_labels= names1)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd0f9b19f90>
Observations: This model, model_aug, has the same architecture as model_1, but includes images with dat augmentation- rotaion_range= 180. This model with augmented images has slightly higher accuracy than model_1. model_aug had fewest misclassifications: 32. Slightly more misclassifications of parasitized as uninfected (19), this is much lower than any other model. Other data augmentations were attempted: combinations of rotation range with zoom range, width shift range, height shift range, brightness range, zoom range and vertical flip. Combinations of data augmentations gave lower accuracy than just rotation range= 180.
Now, let us try to use a pretrained model like VGG16 and check how it performs on our data.
Pre-trained model (VGG16)¶
- Import VGG16 network upto any layer you choose
- Add Fully Connected Layers on top of it
# reset keras backend state
backend.clear_session()
# fix the seed for reproducibility
seed= 24
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
# make the vgg layers
vgg= VGG16(input_shape= (224, 224, 3), weights= 'imagenet', include_top= False)
# dont train the vgg layers
for layer in vgg.layers:
layer.trainable= False
# make the model
vgg_model= Sequential() # make sequential model
vgg_model.add(vgg) # add vgg layers
vgg_model.add(Flatten()) # flatten CNN
vgg_model.add(Dense(2, activation= 'softmax')) # add classification/output layer
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5 58889256/58889256 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
# look at the model summary
vgg_model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ vgg16 (Functional) │ (None, 7, 7, 512) │ 14,714,688 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ flatten (Flatten) │ (None, 25088) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 2) │ 50,178 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 14,764,866 (56.32 MB)
Trainable params: 50,178 (196.01 KB)
Non-trainable params: 14,714,688 (56.13 MB)
Compiling the model
# make the optimizer
opt5= Adam(learning_rate= 0.001)
# compile the model
vgg_model.compile(optimizer= opt5, loss= 'binary_crossentropy', metrics= ['accuracy'])
using callbacks
# use checkpoints
es= EarlyStopping(monitor= 'val_loss', mode= 'min', verbose= 1, patience= 5) # define early stopping
mc= ModelCheckpoint('best_model_vgg.h5', monitor= 'val_accuracy', mode= 'max', verbose= 1, save_best_only= True) # save only best model
Fit and Train the model
# setup image data generator
train_dir= '/content/cell_images/train' # training image directory
batch1= 100 # batch size
# make the training generator
datagen_train2= ImageDataGenerator(preprocessing_function= preprocess_input, rotation_range= 180, validation_split= 0.2)
generator_train2= datagen_train2.flow_from_directory(train_dir, seed= 24, target_size= (224, 224), batch_size= batch1, subset= 'training')
# make the validation generator
datagen_val2= ImageDataGenerator(preprocessing_function= preprocess_input, validation_split= 0.2)
generator_val2= datagen_val2.flow_from_directory(train_dir, seed= 24, target_size= (224, 224), batch_size= batch1, subset= 'validation')
Found 19967 images belonging to 2 classes. Found 4991 images belonging to 2 classes.
history5= vgg_model.fit(generator_train2, steps_per_epoch= 199, epochs= 30, validation_data= generator_val2, validation_steps= 50, callbacks= [es, mc], verbose= 1)
Epoch 1/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - accuracy: 0.8677 - loss: 1.6860 Epoch 1: val_accuracy improved from -inf to 0.94931, saving model to best_model_vgg.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 348s 2s/step - accuracy: 0.8681 - loss: 1.6813 - val_accuracy: 0.9493 - val_loss: 0.4141 Epoch 2/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 1:27 444ms/step - accuracy: 0.9900 - loss: 0.1559 Epoch 2: val_accuracy improved from 0.94931 to 0.94951, saving model to best_model_vgg.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 23s 115ms/step - accuracy: 0.9900 - loss: 0.1559 - val_accuracy: 0.9495 - val_loss: 0.4311 Epoch 3/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - accuracy: 0.9572 - loss: 0.3890 Epoch 3: val_accuracy improved from 0.94951 to 0.95091, saving model to best_model_vgg.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 289s 1s/step - accuracy: 0.9572 - loss: 0.3889 - val_accuracy: 0.9509 - val_loss: 0.5113 Epoch 4/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 1:29 451ms/step - accuracy: 0.9500 - loss: 0.4228 Epoch 4: val_accuracy improved from 0.95091 to 0.95592, saving model to best_model_vgg.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 23s 116ms/step - accuracy: 0.9500 - loss: 0.4228 - val_accuracy: 0.9559 - val_loss: 0.4097 Epoch 5/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - accuracy: 0.9567 - loss: 0.4651 Epoch 5: val_accuracy did not improve from 0.95592 199/199 ━━━━━━━━━━━━━━━━━━━━ 277s 1s/step - accuracy: 0.9567 - loss: 0.4651 - val_accuracy: 0.9519 - val_loss: 0.5669 Epoch 6/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 1:29 450ms/step - accuracy: 0.9500 - loss: 0.7239 Epoch 6: val_accuracy improved from 0.95592 to 0.96033, saving model to best_model_vgg.h5
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
199/199 ━━━━━━━━━━━━━━━━━━━━ 23s 115ms/step - accuracy: 0.9500 - loss: 0.7239 - val_accuracy: 0.9603 - val_loss: 0.4227 Epoch 7/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - accuracy: 0.9592 - loss: 0.5475 Epoch 7: val_accuracy did not improve from 0.96033 199/199 ━━━━━━━━━━━━━━━━━━━━ 277s 1s/step - accuracy: 0.9592 - loss: 0.5474 - val_accuracy: 0.9595 - val_loss: 0.5378 Epoch 8/30 1/199 ━━━━━━━━━━━━━━━━━━━━ 1:27 444ms/step - accuracy: 0.9200 - loss: 0.7880 Epoch 8: val_accuracy did not improve from 0.96033 199/199 ━━━━━━━━━━━━━━━━━━━━ 23s 114ms/step - accuracy: 0.9200 - loss: 0.7880 - val_accuracy: 0.9587 - val_loss: 0.6268 Epoch 9/30 199/199 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - accuracy: 0.9607 - loss: 0.5353 Epoch 9: val_accuracy did not improve from 0.96033 199/199 ━━━━━━━━━━━━━━━━━━━━ 317s 1s/step - accuracy: 0.9607 - loss: 0.5352 - val_accuracy: 0.9263 - val_loss: 0.7615 Epoch 9: early stopping
Plot the train and validation accuracy
print(generator_train2.class_indices)
{'parasitized': 0, 'uninfected': 1}
hist5_dict= history5.history
hist5_df= pd.DataFrame(data= hist5_dict, columns= ['accuracy', 'loss', 'val_accuracy', 'val_loss'])
# make figure to compare training and validation accuracy for base_model
plt.plot(hist5_df['accuracy'], label= 'train') # plot training accuracy
plt.plot(hist5_df['val_accuracy'], label= 'validation') # plot validation accuracy
plt.title('Model accuracy for model_vgg \n Transfer Learning') # set plot title
plt.ylabel('Accuracy') # set y axis label
plt.xlabel('Epochs') # set x axis label
plt.legend(loc= 'upper right', bbox_to_anchor= (1.3, 1)); # define and place legend
Observations and insights: Both the training and validation accuracies are high, meaning the model isn't overfit. Initially model has high validation accuracy, however validation accuracy doesn't improve much in further epochs.¶
Evaluating the model
# load weights from best model
vgg_model.load_weights('/content/best_model_vgg.h5')
# resize x_test images and save to list
# create labels (y_test)
test_dir= "/content/cell_images/test" # filepath to test folder
x_test2= [] # list to hold test images
y_test2= [] # list to hold test labels
# for loop to store images in x_test and labels in y_test
i= 0
for i in folders:
newpath2= Path(test_dir) / i # combined filepath of test directory and folder: parasitized or uninfected
files2= newpath2.glob('*') # get all files within filepath
j= 0
for j in files2:
img3= cv2.imread(str(j)) # read in image from folder
img4= cv2.resize(img3, (224, 224)) # resize image to 64 x 64 (matches training image size)
x_test2.append(img4) # add resized image to list
if i == folders[0]: # if parasitized y_test label= 0
y_test2.append(0)
else: # if uninfected y_test label= 1
y_test2.append(1)
x_test2= np.array(x_test2)
y_test2= pd.Series(data= y_test2, name= 'Labels')
# make predictions
pred_5= vgg_model.predict(x_test2)
# convert from one hot encode to single value
pred_5= np.argmax(pred_5, axis= 1)
82/82 ━━━━━━━━━━━━━━━━━━━━ 30s 221ms/step
Plotting the classification report and confusion matrix
# classification report
print(classification_report(y_test2, pred_5))
precision recall f1-score support
0 0.95 0.97 0.96 1300
1 0.97 0.95 0.96 1300
accuracy 0.96 2600
macro avg 0.96 0.96 0.96 2600
weighted avg 0.96 0.96 0.96 2600
# classification matrix
ConfusionMatrixDisplay.from_predictions(y_test2, pred_5, display_labels= names1)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd0f990ae90>
Observations: Transfer learning model with vgg16 has high accuracy: 0.96. This model was used with all CNN layers frozen and weights from imagenet. The training data had the same data augmentations as the previous model (rotation range= 180). The data augmentation model without transfer learning had better performance than the model that used transfer learning from vgg16.
Think about it:¶
- What observations and insights can be drawn from the confusion matrix and classification report?
- Choose the model with the best accuracy scores from all the above models and save it as a final model.
Observations and Conclusions drawn from the final model: The model with best performance was data augmentation model (Model 3 with Data Augmentation). This model had highest accuracy= 0.99. Data augmentation model had the fewest misclassifications of parasitized as uninfected, which is the most severe error.
# classification reports
print('Base_Model classification report')
print(classification_report(y_test, pred_1))
print("-" * 100)
print('Model_1 classification report')
print(classification_report(y_test, pred_2))
print("-" * 100)
print('Model_2 classification report')
print(classification_report(y_test, pred_3))
print("-" * 100)
print('Data augmentation model (model_aug) classification report')
print(classification_report(y_test, pred_4))
print("-" *100)
print('Transfer learning model (vgg_model) classification report')
print(classification_report(y_test, pred_5))
Base_Model classification report
precision recall f1-score support
0 0.95 0.94 0.95 1300
1 0.94 0.95 0.95 1300
accuracy 0.95 2600
macro avg 0.95 0.95 0.95 2600
weighted avg 0.95 0.95 0.95 2600
----------------------------------------------------------------------------------------------------
Model_1 classification report
precision recall f1-score support
0 0.99 0.97 0.98 1300
1 0.97 0.99 0.98 1300
accuracy 0.98 2600
macro avg 0.98 0.98 0.98 2600
weighted avg 0.98 0.98 0.98 2600
----------------------------------------------------------------------------------------------------
Model_2 classification report
precision recall f1-score support
0 0.99 0.81 0.89 1300
1 0.84 0.99 0.91 1300
accuracy 0.90 2600
macro avg 0.91 0.90 0.90 2600
weighted avg 0.91 0.90 0.90 2600
----------------------------------------------------------------------------------------------------
Data augmentation model (model_aug) classification report
precision recall f1-score support
0 0.99 0.99 0.99 1300
1 0.99 0.99 0.99 1300
accuracy 0.99 2600
macro avg 0.99 0.99 0.99 2600
weighted avg 0.99 0.99 0.99 2600
----------------------------------------------------------------------------------------------------
Transfer learning model (vgg_model) classification report
precision recall f1-score support
0 0.95 0.97 0.96 1300
1 0.97 0.95 0.96 1300
accuracy 0.96 2600
macro avg 0.96 0.96 0.96 2600
weighted avg 0.96 0.96 0.96 2600
Insights¶
Refined insights:¶
- What are the most meaningful insights from the data relevant to the problem?
The parasitized red blood cells have at least 1 darker spot. The model needs to be able to recognize the darker spot.
Comparison of various techniques and their relative performance:¶
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
All of the classification reports for the 5 models are printed above. Models had different architectures: different number of layers, different activation functions, some had data augmentation and 1 had transfer learning with vgg16. Data augmentation model had the highest accuracy and the fewest misclassifications of parasitized as uninfected, which is a very severe error. It is most important to minimize the number of parasitized misclassified as uninfected, so more people are correctly identified as parasitized and receive treatment.
Proposal for the final solution design:¶
- What model do you propose to be adopted? Why is this the best solution to adopt?
I propose model_aug be adopted, this was the model 3 with data augmentation. This model had very high accuracy (0.99). Additionally model_aug had the fewest misclassifications of parasitized as uninfected, this error is the most important to minimize. If the parasitized cells are misclassified as uninfected then the patient will not be treated for malaria.